QTW Case Study 4 - ARIMA Modelling using FluNet data from the World Health Organization

Jayson Barker, Brandon Croom, Shane Weinstock

Business Understanding

The World Health Organization (WHO) has created a publically accessible repository of data specific to Flu prevalance called FluNet. The data providers are all collaborating laboratories / national focal points from different countries, areas or territories networked with the WHO. FluNet is a global web-based tool for influenza virological surveillance first launched in 1997. The data at country level are publically available and updated weekly. The data are provided by National Influenza Centres (NICs) of the Global Influenza Surveillance and Response System (GISRS) and other national influenza reference laboratories collaborating actively with GISRS, or are uploaded from WHO regional databases. Using this data source, our team identified an appropriate data source that contained influenza type A and B prevalance by week and year to perform ARIMA modelling on. The weekly data is well suited for ARIMA testing as it provides us weekly set of flu data that is well suited for ARIMA.

An autoregressive integrated moving average (ARIMA) model is one of the many models used in time series analysis. ARIMA models are applied in cases where data show evidence of non-stationarity. Non-stationarity typically occurs when a the data points of the process being modeled are dependent upon time. An ARMINA model is essentially a combination of AR and MA models. The AR part of ARIMA indicates that the variable of interest is regressed on its own prior values. The AR model order is expressed by the value p. The MA part indicates that the regression error is a linear combination of error terms whos values occurred at various times in the past. The MA model order is expressed by the value q. The I indicates that the data values have been differenced and indicates the number of differences. The I order of the model is indicated by the value d. Typically ARIMA models are denoted as ARIMA(p,d,q), where the p value represents the order of the AR model portion, d represents the order of the I portion of the model, and the q value represents the order of the MA model portion.

Another model that is used in against data that is seasonal is the seasonal auto regressive integrated moving average with exogenous factors model, also known as the SARIMAX model. SARIMAX modeling is similar to ARIMA modeling and maintains the same types of p, d, and q values for AR, I and MA respectively. SARIMAX also introduces an s component, which indicates the seasonality of the data. The difference between ARIMA and SARIMAX is the seasonality and exogenous factors.

The parameters we used to extract this data include all influenza type A & B records from January 2016 through September 2020. This provides us a robust data set that includes influenza type A & B positive case counts as well as negative case counts. The overall goal of this report is to show that influenza data can be leveraged within an ARIMA model.

Exploratory Data Analysis

In this section we will discuss the steps and results found as part of the exploratory data analysis (EDA). During EDA we investigate the data looking for data anomalies, missing data and any other data work that needs to occur prior to model building. We first start by loading the data that was extracted from the FluNet website. The data is focused on the US only and covers the 2016 - 2020 timespan

Leveraging the Pandas Profiling tool provides a wealth of information on the data set and helps to speed up the EDA. Summary observations from pandas profiling include:

- There are 22 variables

- There are 246 observations

- Only 5.2% of the data set has missing values

Digging into the specific fields that will be required for our analysis. There are key fields that we must pay special attention to: ALL_INF, SDATE and EDATE.

- ALL_INF is our column of interest as it includes the number of positive cases identified during that specific week of the year.

- SDATE is the start date of the week the data was reported

- EDATE is the end date of the week the data was reported

For these three fields pandas profiling indicates the follwing information:

- SDATE has no missing values, is 100% distinct and is a categorical variable

- EDATE has no missing values, is 100% distinct and is a categorical variable

- ALL_INF has no missing values, is 93.5% distinct and is a numerical variable

All of this information allows us to understand the following about the data:

- There are no duplicate weeks that need to be addressed

- There is no missing data in the fields required for analysis that need to be addressed

- SDATE and EDATE will need to be converted to datatime data types to perform time series analysis

ARIMA Analysis

In this section we will begin the ARIMA analysis. In looking at the plot of the data shown above, we see what looks to be seasonal values. This would lead us to believe the data may not be stationary. Stationarity can be tested in different ways and as such we'll look at three differnt statiionarity tests to get a determination on stationarity. We'll look at three different tests:

- Augmented Dickey-Fuller (ADF) Test

- Phillips-Peron (PP) Test

- Kwiatkowski=Phillips=Schmidt-Shin (KPSS) Test

The results from the stationarity tests seem to indicate the data is non-stationary. The ADF and the KPSS test is detecting that the data is stationary, while the PP method is indicating that the data is non-stationary. Given the domain knowledge available on flu data, it would seem that treating the data as non-stationary would be a good first test.

In the plots below, we perform our first run of auto-correlation to evaluate lags and what parameters may make the most sense for our ARIMA model. In the original series, we see peaks left plot that seem to indicate a seasonal type pattern. The autocorrelation plot shows a cyclic like pattern that typically is representative of seasonal values. In order to remove the seasonality from the data, differencing is needed. Differencing is a method of transforming a time series analysis to remove dependence on time. The second plot below shows the first difference being taken on the data. In the left plot we see some of the seasonality, but the data is starting to move more toward a less time dependent mean. Differencing a second time appears to have the best alignment to the mean, which sets our lag at 3 - right above the cutoff per the autocorrelation visual.

Here we perform our auto-correlation testing for additional ARIMA parameters.

In this section, we evaluate auto-correlation plot to determine the q value.

Using the values defined above for P, D, and Q, we run our ARIMA model with values, 1, 1, and 3. This identifies that the constant is significant, and the moving average is not closed to 0. We can also see that the pvalue of AR and MA terms are highly significant at < 0.05, which indicates this model is sound.

The residuals plot below shows the distribution of the mean about 0 - and also does not indicate a shift in the mean (i.e., a constant mean is present) which gives us additional confidence in our ARIMA model.

Fitting the actual data onto our predicted ARIMA models demonstrates a good fit. This shows that the ARIMA(2,0,0) model is a good approximation of the model.

Another option to determine the p,d,q values of the ARIMA model is to leverage auto ARIMA functionality. With auto ARIMA the intent is to execute multiple models with multiple values of p,d,q to determine the best model configuration for the data. The output of the auto ARIMA execution is shown below. The auto ARIMA function allows for testing against the same stationarity testing models to build out an ARIMA model. For purposes of determining the best model using auto ARIMA we'll leverage all three tests

The output of the auto ARIMA shows that an ARIMA(2,0,0) is deemed the best model when using the ADF and KPSS stationarity test. This is based off of the AIC value of 4156.330 being the lowest of the models executed. The auto ARIMA output using the PP stationarity test was an ARIMA(2,1,0). Logically this model selection makes sense as we know there is seasonality in the data. In the output it is also seen that the SARIMAX(2,1,0) model looks like a viable model and has the same AIC as the ARIMA model. This helps to reinforce the similarities between ARIMA and SARIMAX models. Between the three models the ARIMA(2,1,0) also has the lowest AIC at 4150.193. Not a huge difference from the other two but lower none the less.

In order to test the forecasting of the ARIMA we will create a train/test split of the data and reexecute the model. The predictions from the model will become the forecast. This approach will allow us to see how close the model forecasts. Below we'll build out our train/test split and plot the training and test splits to visualize which part of the data the forecast will go against.

Next we'll execute the auto ARIMA against the training data set to determine which model is best suited for this analysis. As we can see from the model output below, as with the full data set the auto ARIMA selected the ARIMA(2,0,0).

Utilizing the prediction capability of the model we can test to see how well the model fits the test portion of the data. The plot below shows the output of those results. From the plot we see the prediction mimics the test set fairly well. It follows the downward slope but does begin to trend back up earlier than the test data.

Conclusion

In conclusion, leveraging the flu data for the US from 2016-2020 we have been able to validate that time series modeling can be performed on the data and achieve reasonable results. Over the course of this report we have shown the following items:

- Check for stationarity within a data set to determine how to approach modeling

- Determining manually the best p,d,q values for an ARIMA model

- Forecasting of the manual model

- Determining through auto ARIMA the best p,d,q values for an ARIMA model

- Forecasting of the auto ARIMA model